HoPE: A Hybrid Approach to Position Embedding for Length Generalization in Vision-Language Models
Towards Robust Long-Context Understanding and Retrieval in Video-Comprehension Tasks
Authors: H. Li et al.
Published on Arxiv: 2025-05-26
Link: http://arxiv.org/abs/2505.20444v1
Institutions: Carnegie Mellon University • Xiaohongshu Inc.
Keywords: Vision-Language Models, Rotary Position Embedding, Hybrid Position Embedding, Multimodal Transformers, Video Understanding, Dynamic Temporal Scaling, Long-Context Modeling, Qwen2, Video Retrieval, Semantic Modeling
Recent progress in Vision-Language Models (VLMs) has enabled significant advances in multimodal tasks. However, these models tend to experience marked performance drops when dealing with long-context scenarios, such as the comprehension of lengthy videos. Rotary Position Embedding (RoPE), while effective for large language models, struggles to capture complex spatial-temporal dependencies in such multimodal and video-centric contexts. Moreover, current adaptations of RoPE often employ heuristic frequency allocation without robust theoretical support and face limitations in modeling long-range semantic relationships.
Expanding on these challenges, the article proposes an innovative solution with new approaches and contributions outlined as follows:
- Introduction of HoPE (Hybrid of Position Embedding) to overcome the limitations of RoPE in vision-language models targeting long-context tasks.
- Development of a hybrid frequency allocation (HFA) strategy, where higher frequencies are assigned to spatial (x, y) positions and the lowest frequencies are set to zero for temporal (t) dimensions, offering theoretical guarantees for modeling semantic relationships across extended contexts.
- Proposal of a dynamic temporal scaling (DTS) mechanism that allows adaptive compression or expansion of temporal indices, enhancing the model’s capability to handle variable video speeds and densities of information.
- Comprehensive theoretical analysis that demonstrates existing RoPE limitations and justifies the HoPE approach.
- Rigorous experimental validation using four video benchmarks (MLVU, LongVideoBench, Video-MME, V-NIAH) and strong VLM backbones (Qwen2-2B/7B-Video), comparing with baseline models such as vanilla RoPE, M-RoPE, and VideoRoPE.
- Extensive ablation studies to evaluate the impact of each component within HoPE.
These methodological innovations naturally lead to a discussion of achieved results, which are summarized below:
- HoPE consistently surpasses vanilla RoPE, M-RoPE, and VideoRoPE across all tested benchmarks and context lengths.
- Achieves a 2–4% improvement in long video understanding tasks, and up to 22.23% enhancement in long video retrieval over baseline models.
- Gains from HoPE increase with the size of the backbone model (comparing Qwen2-2B to Qwen2-7B backbones).
- Both hybrid frequency allocation and dynamic temporal scaling contribute meaningfully to the performance, with their combination yielding the most significant improvements.
- The dynamic scaling mechanism enhances robustness for varying video lengths and speeds, and introduces flexibility during inference.
Building on these results, the article concludes with the following key takeaways:
- Current multimodal RoPE strategies are suboptimal for long-context modeling in VLMs.
- HoPE’s theoretically motivated hybrid frequency allocation and dynamic temporal scaling enable substantial advances in robustness and length generalization for vision-language applications.
- HoPE achieves superior and scalable performance for long video comprehension and retrieval tasks.
- Future research directions may explore further scaling to even larger model architectures (such as 13B and 72B parameter backbones).